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We present an algorithm to automatically derive Feynman rules for lattice perturbation theory in 



SO background field gauge. Vertices with an arbitrary number of both background and quantum legs 



can be derived automatically from both gluonic and fermionic actions. The algorithm is a gen- 
eralisation of our earlier algorithm based on prior work by Luscher and Weisz. We also present 
i—i techniques allowing for the parallelisation of the evaluation of the often rather complex lattice 

Feynman rules that should allow for efficient implementation on GPUs, but also give a signif- 
t— I icant speed-up when calculating the derivatives of Feynman diagrams with respect to external 

momenta. 
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1. Automated Lattice Perturbation Theory 

While the lattice offers a fully nonperturbative regulator for Quantum Field Theory, perturba- 
tive calculations still play an important role in renormalising, matching and improving operators, 
both those appearing in the action and those used to measure observables. For the highly im- 
proved actions now widely used, such as the Highly Improved Staggered Quark (HISQ) action [1] 
or moving NRQCD, [2] manual derivation and implementation of the Feynman rules would be 
highly impractical. Automated methods are therefore required. A general algorithm to derive the 
Feynman rules for arbitrary traced closed Wilson loops was derived by Luscher and Weisz in [3]. 
We have generalised this algorithm to include fermionic fields [4] and have implemented it in the 
HiPPy/HPsrc packages [5] . 

1.1 The HiPPy package 

HiPPy is an is an automated tool for generating Feynman rules from arbitrary lattice actions, 
which is written entirely in Python [6] . Starting from the perturbative expansion U^(x) = e^° A ' i ( x+ 2 ) 
of the link variables, the action is expanded as 

JSf = Tt(U<#)+WU®Y 

r r 

with vertex functions 

V%::%(k u . ..-A,) Tri7<" • ••/"•; x ]>>'^ ' 

i 

v^ bc (k h ...,k r ; P ,q) = (/"•••/"•wx£./;r ( /'-' l ^ < ^ 

i 

giving the Feynman rules. 

The algorithm for achieving this expansion starts from the encoding of individual terms in the 
vertex function as so-called entities 

Et = (x i ,y i ;vi t i,...,Vi/,a i ) 
each of which carries an amplitude . The crucial property of entities is the multiplication rule 
EE' = (x.y' +c;v\ v r .v\+c v' + c; a k ) 

where c = x' — y and Cfy, is defined via r^r^ — tyajdX ctf Entities differing by only translations 
are equivalent by momentum conservation. Additional structure (e.g. a non-trivial colour struc- 
ture) can also be encoded by adding additional fields to the entity and amending the entity algebra 
accordingly. 

A field is then defined as a double mapping 

F = {(p. 1 ,...,n r )^{E i ^fi}} 
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which encodes a generic Wilson line, and multiplication of field objects is defined accordingly in 
terms of entity multiplication by 

FF' = { fa, . . . , Hr,Hu ■ ■ ■ , Hi) ^ i EE ' C„<jtff} } 

where C rs = (r + s) !/ (Hj!) is a combinatorial factor and <p is the phase from the multiplication of 
the spin matrices belonging to the entities E and E', as defined above. Addition of field objects is 
defined by the addition of the amplitudes belonging to the individual entities, with the amplitude 
in the sum of an entity present in only one of the summands being set to its amplitude in that 
summand. 

The basic building block from which smeared links, operators and actions are constructed in 
an iterative fashion is the simple link encoded as 

u^x) = e^ + f) = {( M) ..., M )^|(o,A;|,..,|;o)^i}}, 

and predefined building blocks (e.g. smeared links, covariant derivatives and field strength tensors) 
constructed from this are provided as part of HiPPy. 

1.2 The HPsrc package 

The HPsrc package consists of a suite of Fortran 95 modules complementing HiPPy. While 
the output of HiPPy is in principle suitable to being converted directly into an analytic form, this is 
not usually necessary or even useful in lattice perturbation theory. We therefore have implemented 
routines that read in the HiPPy-generated Feynman rules at runtime and use them to construct the 
vertex functions and propagators for given momenta on the fly. This also offers the great advantage 
of being able to write Feynman diagrams in an action-blind way, so as to be able to recompute 
the same quantities for another action without needing to write new code. HPsrc also provides 
facilities for automated differentiation of Feynman diagrams, so that neither analytic manipulations 
nor inaccurate numerical derivatives are needed. 

2. Incorporating Background Fields 

The background field technique has long been known as a valuable tool in field theory. In [7], 
Liischer and Weisz showed that the theorem about dimensional regularisation stating that renor- 
malisation of the effective action does not require additional counterterms beyond those needed 
for the renormalisation of the action holds also in the case of lattice gauge theory. This makes it 
possible to use the background field technique to perform calculations such as relating the bare 
lattice coupling to the MS coupling [8]. To determine the coefficient of the a • B term in the (mov- 
ing) NRQCD action [9], only the background field technique can guarantee the gauge invariance 
of higher-dimensional operators which will necessarily be generated in an effective theory such as 
(m)NRQCD. This makes it desirable to incorporate support for the background field method into 
the HiPPy/HPsrc packages. 
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2.1 Background fields in HiPPy 

We decompose our fields into a background field and quantum fluctuations by parametris- 
ing the basic gauge link as 

Un(x) =e ^(*+§W*+f) 

We note that this does not affect the combinations of x, y, v ; j that are possible, and that hence 
the entity algebra remains unaffected. However, the gluon fields living at each lattice point v can 
now be either quantum or background, with the quantum fields always appearing to the left of the 
background fields coming from the same link, and thus the field objects need to keep track of nature 
of individual gluon fields. 

This leads to a new mapping structure for field objects given now by 

F = {{^...^ r )^{E i ^{ff-\...jf- B )}} 

with an order-r entity being mapped to an 2 r -tuple of amplitudes. 
With this structure, multiplication of fields now assigns 

j?xi-x r yi-y s ._ c ^ jrxy- Xr j?yi~y s 
where C rs and are defined as before, and the simple link becomes 

c/^(^) = e^(-+#) = 1^,...,^)^ |(^o,A;§,...,|;o) ^ (1,...,/^ ^ , 

where 

jq s B r - s _ r - 

s\(r — s)\ 

and all other f x vanish identically. This definition maintains the ordering of the background and 
quantum fields throughout the expansion procedure. We note that this precludes performing any 
symmetrisation over the gluon fields at this stage, and that in background gauge all symmetrisation 
is to be deferred to the evaluation of the Feynman diagrams. 

2.2 Background field gauge in HPsrc 

In order to support background field calculations, we also need to extend HPsrc so that it 
supports fixing to background field gauge. 
The gauge fixing term for this is 

^/ = -iTr(D^) 2 



2^ 

with 



= [f(x)-e- B ^-^f(x-fi)e B ^-^ 

which gives rise to additional terms in all purely gluonic vertices with exactly two quantum fields. 
These terms have been implemented in HPsrc for the propagator and three-gluon vertex. 

Similarly, the Fadeev-Popov ghost action in background field gauge involves background co- 
variant derivatives instead of finite differences, and Ad(q^) instead of Ad(A^). 
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3. Accelerated Automatic Differentiation 

In HPsrc, we use TaylUR [10] for the automatic computation of derivatives with respect to 
an external momentum P. In this way, derivative information is automatically propagated from 
vertices and propagators to Feynman diagrams and quantities constructed therefrom, but in order to 
save computation time, the derivatives of the vertex functions themselves are constructed explicitly 
in the HPsrc code. 

For a momentum decomposed as kj = kj + rjP, the derivative of the reduced vertex reads 



We note that the flC • •) term above is momentum-independent, and needs to be computed only 
once in order to compute the derivatives for each momentum in a set of momenta with identical 
routings ry, only the exponential needs to be recomputed for each momentum. It is therefore 
possible to achieve a significant speed-up by momentum vectorisation, such that the vertex function 
is called with a vector of momenta with identical routings rj and returns a vector of values. 

4. GPU acceleration of Reduced Vertices 

The main work (about 85%) in a perturbative calculation using HPsrc comes from evaluating 
the reduced vertex. This function is particularly suitable for parallelisation using a General Purpose 
GPU (GPGPU). Data transfer times are small and the momentum vectorisation discussed above 
makes the overhead per momentum point negligible (as the monomial data can be reused). 

The reduced vertex routine was extracted as a separate kernel for testing. Derivatives of the 
reduced vertex were not initially calculated. The kernel consists of a two-level loop nest. The outer 
loop (over independent momentum points with loop index n) is trivially parallel. The inner loop 
(the sum over monomials with loop index i) contained some dependencies. We fixed the number 
of monomials at 8000 (representative of the HISQ action) and varied the number of points. 

The kernel was accelerated on an NVIDIA Fermi C2050 GPU in two ways. First, a corre- 
sponding CUDA kernel was written and called via a C wrapper. Secondly, the original Fortran90 
was accelerated using the directives implemented in the PGI compiler. 

Initially, only the outer loop was parallelised, explicitly in the CUDA and automatically by 
the PGI compiler. In both cases, GPU performance was significantly increased by reordering the k 
array so that index n was the fastest- varying in memory. This allowed "coalescing" of loads from 
the GPU global memory. 

For a wide range of problem sizes, the CUDA kernel (including data transfers) was around 
52 times faster than the best version of the kernel running on a single core of an Intel Nehalem 
processor. The PGI directive version was only 20% slower than the CUDA kernel. 

Further performance gains were obtained by restructuring the kernel so that loops over both 
n and i could be parallelised. When this was done in the PGI directive-accelerated kernel, the 
performance matched that of the original CUDA version. The same refactoring in the CUDA is 
currently in progress. Our results are summarised in Fig. 1. 

This work is now being ported into the main application. 
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Figure 1: Comparing the performance of the HPsrc kernel on an NVIDIA Fermi GPU for increasing num- 
bers of momentum points. 



5. Summary 

The HiPPy /HPsrc package has been extended to enable calculations in background field gauge. 
The new functionality will be used to calculate to @{a s ) corrections to the coefficients of the 
(m)NRQCD action. 

The re-use of common routing information enables a significant speed-up of calculations in- 
volving automatic differentiation of vertices. A further speed-up can be achieved by rewriting the 
generic vertex functions as CUDA kernels. An optimised implementation is in preparation. 
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